In [2]:
install.packages("tidyr")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("gtable")
install.packages("gamlss")
install.packages("mclust")
install.packages("igraph")
install.packages("devtools")
install.packages("conflicted")


Installing package into 'C:/Users/mvaudel/Documents/R/win-library/3.6'
(as 'lib' is unspecified)

package 'tidyr' successfully unpacked and MD5 sums checked
Warning message:
"cannot remove prior installation of package 'tidyr'"
Warning message in file.copy(savedcopy, lib, recursive = TRUE):
"problem copying C:\Users\mvaudel\Documents\R\win-library\3.6\00LOCK\tidyr\libs\x64\tidyr.dll to C:\Users\mvaudel\Documents\R\win-library\3.6\tidyr\libs\x64\tidyr.dll: Permission denied"
Warning message:
"restored 'tidyr'"
The downloaded binary packages are in
	C:\Users\mvaudel\AppData\Local\Temp\RtmpENPI9l\downloaded_packages
Installing package into 'C:/Users/mvaudel/Documents/R/win-library/3.6'
(as 'lib' is unspecified)

package 'dplyr' successfully unpacked and MD5 sums checked
Warning message:
"cannot remove prior installation of package 'dplyr'"
Warning message in file.copy(savedcopy, lib, recursive = TRUE):
"problem copying C:\Users\mvaudel\Documents\R\win-library\3.6\00LOCK\dplyr\libs\x64\dplyr.dll to C:\Users\mvaudel\Documents\R\win-library\3.6\dplyr\libs\x64\dplyr.dll: Permission denied"
Warning message:
"restored 'dplyr'"
The downloaded binary packages are in
	C:\Users\mvaudel\AppData\Local\Temp\RtmpENPI9l\downloaded_packages
Installing package into 'C:/Users/mvaudel/Documents/R/win-library/3.6'
(as 'lib' is unspecified)

package 'ggplot2' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\mvaudel\AppData\Local\Temp\RtmpENPI9l\downloaded_packages
Installing package into 'C:/Users/mvaudel/Documents/R/win-library/3.6'
(as 'lib' is unspecified)

package 'gtable' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\mvaudel\AppData\Local\Temp\RtmpENPI9l\downloaded_packages
Installing package into 'C:/Users/mvaudel/Documents/R/win-library/3.6'
(as 'lib' is unspecified)

package 'gamlss' successfully unpacked and MD5 sums checked
Warning message:
"cannot remove prior installation of package 'gamlss'"
Warning message in file.copy(savedcopy, lib, recursive = TRUE):
"problem copying C:\Users\mvaudel\Documents\R\win-library\3.6\00LOCK\gamlss\libs\x64\gamlss.dll to C:\Users\mvaudel\Documents\R\win-library\3.6\gamlss\libs\x64\gamlss.dll: Permission denied"
Warning message:
"restored 'gamlss'"
The downloaded binary packages are in
	C:\Users\mvaudel\AppData\Local\Temp\RtmpENPI9l\downloaded_packages
Installing package into 'C:/Users/mvaudel/Documents/R/win-library/3.6'
(as 'lib' is unspecified)

package 'mclust' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\mvaudel\AppData\Local\Temp\RtmpENPI9l\downloaded_packages
Installing package into 'C:/Users/mvaudel/Documents/R/win-library/3.6'
(as 'lib' is unspecified)

package 'igraph' successfully unpacked and MD5 sums checked
Warning message:
"cannot remove prior installation of package 'igraph'"
Warning message in file.copy(savedcopy, lib, recursive = TRUE):
"problem copying C:\Users\mvaudel\Documents\R\win-library\3.6\00LOCK\igraph\libs\x64\igraph.dll to C:\Users\mvaudel\Documents\R\win-library\3.6\igraph\libs\x64\igraph.dll: Permission denied"
Warning message:
"restored 'igraph'"
The downloaded binary packages are in
	C:\Users\mvaudel\AppData\Local\Temp\RtmpENPI9l\downloaded_packages
Installing package into 'C:/Users/mvaudel/Documents/R/win-library/3.6'
(as 'lib' is unspecified)

package 'devtools' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\mvaudel\AppData\Local\Temp\RtmpENPI9l\downloaded_packages

Mutations in the genome can alter the sequence so that non-coding sections become coding. e.g. through the introduction of a start codon or the alteration of a stop codon. In order to identify these non-canonical genomic products, protein databases that capture genetic variation and non-canonical genomic products are generated either by enriching canonical protein sequences or by running six reading frame translation of the entire genome (1).

Based on you knowledge of peptide and protein identification, can you anticipate challenges posed by these proteogenomic databases?

Libraries

We will need the following libraries, please make sure that they are installed.


In [2]:
library(tidyr)
library(dplyr)
library(ggplot2)
library(scico)

theme_set(theme_bw(base_size = 11))


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


Data set

In this tutorial, we will analyze the non-canonical genomic products identified in breast cancer by Johansson et al. (2). Note that this tutorial does not cover the database generation, search, and validation of identification results. These bioinformatic procedures are very demanding and we strongly advise to make sure that they are in place at your lab or at the facility processing the data before conducting any proteogenomic experiment. The proteogenomic identification results by Johansson et al. (2) are reported in Supplementary Data 6, available here in the course repository.

What do the different columns in the table represent?

For this tutorial, the Novel Peptides table was extracted to an R-friendly text format, and is available in resources/data/novel_peptides.gz.

👨‍💻 Load the data in R as in the code below.

In [3]:
novelPeptidesDF <- read.table(
    file = "resources/data/novel_peptides.gz",
    header = T,
    sep = "\t",
    comment.char = "",
    quote = "",
    stringsAsFactors = F
)

Genomic context and function

👨‍💻 Find the different classes of loci represented.

In [5]:
classesDF <- as.data.frame(
    table(
        novelPeptidesDF$class
    )
) %>%
    rename(
        class = Var1,
        n_peptides = Freq
    ) %>%
    arrange(
        desc(n_peptides)
    )

print(classesDF)


                     class n_peptides
1               intergenic        172
2                 intronic         91
3           ncRNA_intronic         22
4             ncRNA_exonic         18
5                   exonic         17
6                     UTR5         16
7              UTR5-exonic         14
8              exonic-UTR5         12
9          intronic-exonic         10
10         exonic-intronic          4
11                upstream          3
12     exonic-ncRNA_exonic          2
13         exonic-splicing          1
14         exonic-upstream          1
15 intergenic-ncRNA_exonic          1
16     ncRNA_exonic-exonic          1
17       splicing-intronic          1
18                    UTR3          1
19           UTR5-upstream          1
Can you speculate on how these different classes of loci can yield novel peptides?
👨‍💻 For the non-intergenic peptides, find the different categories of associated genes.

In [9]:
novelPeptidesDF %>% 
    filter(
        class != "intergenic"
    ) %>%
    select(
        nearest_gene, category
    ) -> geneDF

categoriesDF <- as.data.frame(
    table(
        geneDF$category
    )
) %>%
    rename(
        category = Var1,
        n_peptides = Freq
    ) %>%
    arrange(
        desc(n_peptides)
    )

print(categoriesDF)


        category n_peptides
1     pseudogene         93
2           5UTR         43
3       intronic         33
4 exonic.Alt.ORF         18
5 exon_extension         14
6          ncRNA         12
7     intergenic          2
8           3UTR          1
What do these categories represent?
👨‍💻 Select peptides from the different classes and categories, and inspect the genetic landscape at these positions using the Ensembl or UCSC genome browsers. Example for locus #9 (Intronic)
👨‍💻 Select genes possibly influenced by the transcription/translation change and inspect the function of the associated proteins.

Abundance in tumors and normal tissue

In Supplementary Table 8, the authors provide the abundance for novel peptides monitored in normal tissue and tumors for five patients. The table was extracted to an R-friendly text format for this tutorial, and is available in resources/data/novel_peptides_paired.gz.

👨‍💻 Load the data in R as done for the previous table. In addition, transform the data from wide to long format and create columns indicating the patient number, whether the sample is Control or Tumor, and what kind of tumor, as done in the code below.

In [10]:
novelPeptidesDF <- read.table(
    file = "resources/data/novel_peptides_paired.gz",
    header = T,
    sep = "\t",
    comment.char = "",
    quote = "",
    stringsAsFactors = F
) %>% 
    gather(
        "Control_1", "Control_2", "Control_3", "Control_4", "Control_5",
        key = "control_id",
        value = "control"
    ) %>% 
    select(
        -control_id
    ) %>%
    gather(
        "LumA_1", "Her2_2", "LumB_3", "Basal_4", "Her2_5",
        key = "tumor_id",
        value = "tumor"
    ) %>%
    separate(
        col = "tumor_id",
        into = c("tumorType", "patientNumber"),
        sep = "_"
    ) %>%
    mutate(
        patientId = paste("Patient", patientNumber)
    ) %>%
    arrange(
        abs(tumor - control)
    )
👨‍💻 Plot the peptide abundance in the tumor vs control tissue for all peptides and all patients.

In [11]:
ggplot(
    data = novelPeptidesDF
) +
    geom_hline(
        yintercept = quantile(
            x = novelPeptidesDF$tumor, 
            probs = c(0.2, 0.8),
            na.rm = T
        ),
        col = "black",
        linetype = "dotted"
    ) +
    geom_vline(
        xintercept = quantile(
            x = novelPeptidesDF$control, 
            probs = c(0.2, 0.8),
            na.rm = T
        ),
        col = "black",
        linetype = "dotted"
    ) +
    geom_point(
        mapping = aes(
            x = control,
            y = tumor,
            col = log10(ms1_area)
        )
    ) +
    facet_grid(
        tumorType ~ .
    ) +
    scale_x_log10(
        name = "Intensity in control tissue"
    ) +
    scale_y_log10(
        name = "Intensity in tumor tissue"
    ) + 
    scale_color_scico(
        name = "MS1 Area [log10]",
        palette = "batlow"
    ) +
    theme(
        legend.position = "top",
        panel.grid = element_blank()
    )


Warning message:
"Removed 116 rows containing missing values (geom_point)."
💬 How do you interpret this plot?

Conclusion

💬 Can you speculate on the function or effect of these novel peptides in cancer biology? How can these be used in a clinical setup?

In [ ]: